Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More →

parse-latin

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

parse-latin

Latin-script (natural language) parser

0.2.0
Source
npm

Version published: 10 years ago

Weekly downloads: 468K; increased by2.2%

Maintainers: 1

Weekly downloads

Created: 10 years ago

What is parse-latin?

The parse-latin npm package is a JavaScript library used to parse Latin-script natural language into a syntax tree. It is particularly useful for text processing tasks such as tokenization, sentence splitting, and word segmentation.

What are parse-latin's main functionalities?

Tokenization

This feature allows you to tokenize a given text into individual tokens (words, punctuation, etc.). The code sample demonstrates how to tokenize a simple sentence.

const ParseLatin = require('parse-latin');
const parser = new ParseLatin();
const tokens = parser.tokenize('This is a sentence.');
console.log(tokens);

Sentence Splitting

This feature enables you to split a paragraph into individual sentences. The code sample shows how to split a paragraph into separate sentences.

const ParseLatin = require('parse-latin');
const parser = new ParseLatin();
const sentences = parser.tokenizeParagraph('This is a sentence. This is another sentence.');
console.log(sentences);

Word Segmentation

This feature allows you to segment a sentence into individual words. The code sample demonstrates how to segment a sentence into words.

const ParseLatin = require('parse-latin');
const parser = new ParseLatin();
const words = parser.tokenizeWords('This is a sentence.');
console.log(words);

Other packages similar to parse-latin

parse-latin

A Latin script language parser producing NLCST nodes.

For semantics of nodes, see NLCST;
For a pluggable system to analyze and manipulate language, see retext.

Whether Old-English (“þā gewearþ þǣm hlāforde and þǣm hȳrigmannum wiþ ānum penninge”), Icelandic (“Hvað er að frétta”), French (“Où sont les toilettes?”), parse-latin does a good job at tokenizing it.

Note also that parse-latin does a decent at tokenizing Latin-like scripts, Cyrillic (“Добро пожаловать!”), Georgian (“როგორა ხარ?”), Armenian (“Շատ հաճելի է”), and such.

Installation

npm:

$ npm install parse-latin

Component:

$ component install wooorm/parse-latin

Bower:

$ bower install parse-latin

Usage

var ParseLatin = require('parse-latin'),
    latin = new ParseLatin();

latin.parse('A simple sentence.');
/**
 * Logs something like:
 * ˅ Object
 *    ˃ children: Array[1]
 *      type: "RootNode"
 *    ˃ __proto__: Object
 */

latin.parse(
    'The \xC5 symbol invented by A. J. A\u030Angstro\u0308m ' +
    '(1814, Lo\u0308gdo\u0308, \u2013 1874) denotes the ' +
    'length 10\u207B\xB9\u2070 m.'
);
/**
 * Logs something like:
 * ˅ Object
 *    ˃ children: Array[1]
 *      type: "RootNode"
 *    ˃ __proto__: Object
 */

API

ParseLatin()

Exposes the functionality needed to tokenize natural Latin-script languages into a syntax tree.

ParseLatin#tokenize(value)

Tokenize natural Latin-script language into letter and numbers (words), white space, and everything else (punctuation).

ParseLatin#parse(value)

Tokenize natural Latin-script languages into an NLCST syntax tree.

var ParseLatin = require('parse-latin'),
    latin = new ParseLatin();

latin.parse('A simple sentence.');
/**
 * Object
 * ├─ type: "RootNode"
 * └─ children: Array[1]
 *     └─ 0: Object
 *           ├─ type: "ParagraphNode"
 *           └─ children: Array[1]
 *              └─ 0: Object
 *                    ├─ type: "SentenceNode"
 *                    └─ children: Array[6]
 *                       ├─ 0: Object
 *                       |     ├─ type: "WordNode"
 *                       |     └─ children: Array[1]
 *                       |        └─ 0: Object
 *                       |              ├─ type: "TextNode"
 *                       |              └─ value: "A"
 *                       ├─ 1: Object
 *                       |     ├─ type: "WhiteSpaceNode"
 *                       |     └─ children: Array[1]
 *                       |        └─ 0: Object
 *                       |              ├─ type: "TextNode"
 *                       |              └─ value: " "
 *                       ├─ 2: Object
 *                       |     ├─ type: "WordNode"
 *                       |     └─ children: Array[1]
 *                       |        └─ 0: Object
 *                       |              ├─ type: "TextNode"
 *                       |              └─ value: "simple"
 *                       ├─ 3: Object
 *                       |     ├─ type: "WhiteSpaceNode"
 *                       |     └─ children: Array[1]
 *                       |        └─ 0: Object
 *                       |              ├─ type: "TextNode"
 *                       |              └─ value: " "
 *                       ├─ 4: Object
 *                       |     ├─ type: "WordNode"
 *                       |     └─ children: Array[1]
 *                       |        └─ 0: Object
 *                       |              ├─ type: "TextNode"
 *                       |              └─ value: "sentence"
 *                       └─ 5: Object
 *                             ├─ type: "PunctuationNode"
 *                             └─ children: Array[1]
 *                                └─ 0: Object
 *                                      ├─ type: "TextNode"
 *                                      └─ value: "."
 */

Syntax Tree Format

Note: The easiest way to see how parse-latin tokenizes and parses, is by using the online parser demo, which shows the syntax tree corresponding to the typed text.

Basically, parse-latin splits text into white space, word, and punctuation tokens. parse-latin starts out with a pretty easy definition, one that most other tokenizers use:

A “word” is one or more letter or number characters;
A “white space” is one or more white space characters;
A “punctuation” is one or more of anything else;

Then, it manipulates and merges those tokens into an NLCST syntax tree, adding sentences and paragraphs where needed.

Some punctuation marks are part of the word they occur in, e.g., non-profit, she\'s, G.I., 11:00, N/A, &c, nineteenth- and...;
Some full-stops do not mark a sentence end, e.g., 1., e.g., id.;
Although full-stops, question marks, and exclamation marks (sometimes) end a sentence, that end might not occur directly after the mark, e.g., .), .";
And many more exceptions.

Benchmark

On a MacBook Air, parse-latin parses 2 large books, 25 big articles, or 2,056 paragraphs per second.

To put things into perspective, Shakespeare’s works contain 884,647 words. I have not tested it, but in theory parse-latin should parse these works in (slightly above) four seconds.

             latin.parse(document);
  2,056 op/s » A paragraph (5 sentences, 100 words)
    267 op/s » A section (10 paragraphs)
     25 op/s » An article (10 sections)
      2 op/s » A (large) book (10 articles)

License

Keywords

FAQs

What is parse-latin?

Is parse-latin popular?

Is parse-latin well maintained?

Package last updated on 14 Oct 2014

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

parse-latin

What is parse-latin?

What are parse-latin's main functionalities?

Other packages similar to parse-latin

compromise

natural

Installation

Usage

API

ParseLatin()

ParseLatin#tokenize(value)

ParseLatin#parse(value)

Syntax Tree Format

Benchmark

Related

License

Keywords

Related posts

Massive npm Malware Campaign Leverages Ethereum Smart Contracts To Evade Detection and Maintain Control

Author Typosquatting on npm: Attackers Impersonate Sindre Sorhus with Malicious ‘chalk-node’ Package

Supply Chain Attack on LottieFiles Player Caused by Compromised npmjs Credentials